Course: Applied Geo-data Science at the University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
Do you have questions about the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 4 (Chapter 8)
Modeling is an important aspect of data science. But there are different types of models with different levels of complexity. With a realistic problem, the following goals should be trained:
Understand the basics of regression and classification models.
Fit linear and logistic regression models in R.
Choose and calculate relevant model performance metrics.
Evaluate and compare regression models.
Detect data outliers.
Select best predictive variables.
In environmental and climate science, dozens of variables have been measured. With a model, we want to try to predict a target. For that, we use a linear model. Unfortunately, we do not know how many variables and which variables we should use. We only know that we try to find a formula like this:
\[ \hat{y}=a_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]
To build up the best linear regression model, we could, of course, simply try all combinations by permutation. But that would be a waste of resources. However, we cannot avoid a try-and-error approach. But we would like to reduce the number of combinations. We implement an algorithm that works as follows:
Let \(p\in[1,N]\) and a predictor. Let \(N\in[1,max(variable)]\). N could be all the variables we have or only a subset of them. Our algorithm will decide this (at the start, we did not know).
The algorithm takes p and creates a linear model. The p with the highest RSQ will be used as a predictor. We delete the predictor from our list because we can use it only once.
The algorithm takes p+1 and creates a linear model. The p with the highest RSQ will be used as a predictor. Select the model with the highest RSQ and calculate the AIC.
Iterate step 3 until the AIC of the model does not decrease anymore.
We have found the (presumably) optimal model.
For the first task, we apply only the first 2 steps because we want only a bivariate model. For the second task, we iterate step 3. If we do that, we might also be able to choose the best multivariate model.
The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.
# Load the file
source("../../R/general/packages.R")
First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.
# We want to know, if a certain file already exists
name.of.file <- "../../data/re_stepwise/FLX_CH_stepwise.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/df_for_stepwise_regression
.csv"
# Read the data directly from an URL
database.S1 <- read.table(url, header = TRUE, sep = ",")
# Write a CSV file in the repository
write_csv(database.S1,"../../data/re_stepwise/FLX_CH_stepwise.csv")
# Read the file
database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")
# If exists such a file, read it only!
}else{database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")}
To make a good decision about the predictors, we need as much information as possible. For that, we read our file as a data frame. After that, we want an overview of the data (what variables contain the data set, and what are the main statistical parameters?)
# Check wheater there are any missing values (We try with only the first six rows)
head(apply(database,2, function(x) is.na(x)))
## siteid TIMESTAMP TA_F SW_IN_F LW_IN_F VPD_F PA_F P_F WS_F TA_F_MDS
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## SW_IN_F_MDS LW_IN_F_MDS VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF USTAR
## [1,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [2,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [3,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [5,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [6,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# Load the function
source("../../R/general/function.visualize.na.values.R")
# Function call to visualize the NA
visualize.na.values.without.groups(database)
# Take an overview and main statistic parameters
summary(database)
## siteid TIMESTAMP TA_F SW_IN_F
## Length:23011 Min. :1996-01-01 Min. :-29.184 Min. : 0.156
## Class :character 1st Qu.:2002-12-01 1st Qu.: 0.431 1st Qu.: 52.933
## Mode :character Median :2007-02-15 Median : 7.286 Median :120.679
## Mean :2006-10-25 Mean : 6.950 Mean :137.633
## 3rd Qu.:2011-01-23 3rd Qu.: 13.313 3rd Qu.:215.243
## Max. :2014-12-31 Max. : 31.166 Max. :398.192
##
## LW_IN_F VPD_F PA_F P_F
## Min. :138.1 Min. : 0.000 Min. : 80.37 Min. : 0.000
## 1st Qu.:265.0 1st Qu.: 0.831 1st Qu.: 84.31 1st Qu.: 0.000
## Median :300.1 Median : 2.542 Median : 97.38 Median : 0.010
## Mean :295.8 Mean : 3.787 Mean : 93.47 Mean : 2.320
## 3rd Qu.:328.9 3rd Qu.: 5.239 3rd Qu.: 98.77 3rd Qu.: 1.749
## Max. :420.1 Max. :33.290 Max. :103.28 Max. :186.400
##
## WS_F TA_F_MDS SW_IN_F_MDS LW_IN_F_MDS
## Min. :0.106 Min. :-29.184 Min. : 0.156 Min. :138.1
## 1st Qu.:1.805 1st Qu.: 0.446 1st Qu.: 53.317 1st Qu.:269.5
## Median :2.387 Median : 7.356 Median :120.031 Median :303.3
## Mean :2.670 Mean : 6.987 Mean :137.482 Mean :299.4
## 3rd Qu.:3.286 3rd Qu.: 13.301 3rd Qu.:215.580 3rd Qu.:332.9
## Max. :9.928 Max. : 31.166 Max. :398.192 Max. :420.1
## NA's :325 NA's :341 NA's :11057
## VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF
## Min. : 0.000 Min. :297.3 Min. : -0.249 Min. :-6.683
## 1st Qu.: 0.852 1st Qu.:372.1 1st Qu.: 91.040 1st Qu.: 0.802
## Median : 2.612 Median :383.7 Median :231.956 Median : 2.845
## Mean : 3.850 Mean :385.1 Mean :271.714 Mean : 3.503
## 3rd Qu.: 5.326 3rd Qu.:395.6 3rd Qu.:430.028 3rd Qu.: 5.600
## Max. :33.290 Max. :724.2 Max. :995.106 Max. :18.511
## NA's :768 NA's :290 NA's :4574 NA's :2426
## USTAR
## Min. :0.0749
## 1st Qu.:0.2850
## Median :0.3856
## Mean :0.4255
## 3rd Qu.:0.5244
## Max. :1.5941
## NA's :2468
We chose the best variable for our predictor. For that, we tried all variables and chose the one with the highest RSQ (see introduction).
# Access the outsourced function for the bivariate model
source("../../R/re_stepwise/function.bivariate.model.R")
# Function call
tibble.bi.model <- model.fitter(database, 16)
knitr::kable(tibble.bi.model)
| Target | Predictor | RR | AIC | Fit |
|---|---|---|---|---|
| GPP_NT_VUT_REF | siteid | 0.0458181 | 46735.32 | NO |
| GPP_NT_VUT_REF | TIMESTAMP | 0.0047922 | 47597.89 | NO |
| GPP_NT_VUT_REF | TA_F | 0.3871016 | 37619.26 | NO |
| GPP_NT_VUT_REF | SW_IN_F | 0.4306405 | 36102.41 | NO |
| GPP_NT_VUT_REF | LW_IN_F | 0.1676259 | 43919.98 | NO |
| GPP_NT_VUT_REF | VPD_F | 0.1908666 | 43337.05 | NO |
| GPP_NT_VUT_REF | PA_F | 0.0002403 | 47691.83 | NO |
| GPP_NT_VUT_REF | P_F | 0.0020894 | 47653.72 | NO |
| GPP_NT_VUT_REF | WS_F | 0.0082500 | 47526.24 | NO |
| GPP_NT_VUT_REF | TA_F_MDS | 0.3871137 | 37559.45 | NO |
| GPP_NT_VUT_REF | SW_IN_F_MDS | 0.4300828 | 36065.75 | NO |
| GPP_NT_VUT_REF | LW_IN_F_MDS | 0.1209379 | 25130.95 | NO |
| GPP_NT_VUT_REF | VPD_F_MDS | 0.1811937 | 42567.59 | NO |
| GPP_NT_VUT_REF | CO2_F_MDS | 0.0295658 | 47078.98 | NO |
| GPP_NT_VUT_REF | PPFD_IN | 0.4520702 | 29689.54 | BEST FIT |
| GPP_NT_VUT_REF | GPP_NT_VUT_REF | NA | NA | NA |
| GPP_NT_VUT_REF | USTAR | 0.0000316 | 45048.57 | NO |
There are meaningful scatterplots for bivariate LM models, and they give us a first impression. But we want to know more about the coefficients, the intercept, the distribution, and the outliers. For that, we use summary() and plot the model (see discussion).
# Calculate the probebly best multivariate model
best.bi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN, data = database)
# Load the function
source("../../R/re_stepwise/function.vis.bi.model.R")
# Plot the bivariate model
vis.bi.model(database)
# Overview about all important model parameter
# Model Call. We are also interested for the p-values
summary(best.bi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN, data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0475 -1.3255 -0.3758 1.2780 12.3295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.72533 0.03057 23.73 <2e-16 ***
## PPFD_IN 0.01062 0.00009 117.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 16872 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.4521, Adjusted R-squared: 0.452
## F-statistic: 1.392e+04 on 1 and 16872 DF, p-value: < 2.2e-16
# Plot the model
plot(best.bi.lm)
What is happening with RSQ? Here we visualize the RSQ for each variable. We can easily see that PPFD_IN has the highest RSQ, and that is why we chose it as our predictor. But what if we want a multivariate regression? One might think that we should simply choose the variable with the second-highest RSQ. But we will see that this is generally not the case.
# Load the function
source("../../R/re_stepwise/function.vis.and.dev.bi.R")
# Function call
Vis_of_the_develop.bi.model(tibble.bi.model)
Here we want to answer the question of which predictors we should choose to get the probably best linear regression model. For that, we have written a function. It works as described in the introduction. We make two function calls. For the first, we use all variables but the target. This is, however, not necessarily the best way. We could say that the time factor and the location are irrelevant, so we drop these variables. We can see that the difference is huge.
# Access the outsourced function of the multivariate model
source("../../R/re_stepwise/function.multivariate.model.R")
# Function call with all column but target (column number 16)
multi.model.1 <- multivariate.model(database, c(16))
## [1] "Best model fit has AIC = 21762.1756977506"
## [1] "PPFD_IN" "LW_IN_F" "siteid" "VPD_F" "TA_F_MDS" "SW_IN_F"
## [7] "WS_F" "CO2_F_MDS" "P_F" "PA_F"
# Function call without the variables siteid, TIMESTAMP and GPP_NT_VUT_REF
multi.model.2 <- multivariate.model(database, c(1,2,16))
## [1] "Best model fit has AIC = 24469.2854126171"
## [1] "PPFD_IN" "LW_IN_F" "VPD_F" "TA_F_MDS" "SW_IN_F" "P_F"
## [7] "WS_F" "CO2_F_MDS" "PA_F"
# Overview about the models
knitr::kable(multi.model.1)
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| LW_IN_F | 0.5301754 | 27096.52 |
| siteid | 0.5981762 | 24464.35 |
| VPD_F | 0.6160337 | 23699.27 |
| TA_F_MDS | 0.6454925 | 22354.30 |
| SW_IN_F | 0.6513971 | 22072.88 |
| WS_F | 0.6564185 | 21830.06 |
| CO2_F_MDS | 0.6571345 | 21796.85 |
| P_F | 0.6577174 | 21770.14 |
| PA_F | 0.6579195 | 21762.18 |
| TIMESTAMP | 0.6579273 | 21763.80 |
knitr::kable(multi.model.2)
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| LW_IN_F | 0.5301754 | 27096.52 |
| VPD_F | 0.5746412 | 25420.80 |
| TA_F_MDS | 0.5881248 | 24879.25 |
| SW_IN_F | 0.5918328 | 24728.65 |
| P_F | 0.5945354 | 24618.55 |
| WS_F | 0.5963983 | 24542.84 |
| CO2_F_MDS | 0.5981106 | 24473.10 |
| PA_F | 0.5982491 | 24469.29 |
| TA_F | 0.5982635 | 24470.68 |
We visualize the development of the model. We can see that the model with more variables has a higher RSQ. That is not a surprise because RSQ will always increase if we add a variable. Important is the AIC. We can see that the AIC is lower for more variables (21762.2 vs. 24469.3). The RSQ is higher (0.6579 vs. 0.5983). It follows that model 1 is better than model 2. We see that also in the coefficients.
# Load the function
source("../../R/re_stepwise/function.vis.and.dev.multi.R")
# Function call
vis.and.dev.multi.model(multi.model.1)
vis.and.dev.multi.model(multi.model.2, c("[without siteid and TIMESTAMP]"))
There are no meaningful scatterplots for multidimensional models. But we want to know more about the coefficients, the intercept, the distribution, and the outliers. We make some calls by using summary(). We get the same values for the RSQ. This is proof that our algorithm works. We see that not all coefficients are significant.
# Calculate the probably best multivariate model
best.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP,
data = database)
second.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F,
data = database)
# Overview about all important model parameter
# Model Call
summary(best.multi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
## SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP,
## data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3803 -1.1726 -0.0431 1.1877 11.4929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.833e+00 1.609e+00 -3.004 0.00267 **
## PPFD_IN 3.098e-03 4.833e-04 6.411 1.49e-10 ***
## LW_IN_F 7.823e-03 6.821e-04 11.469 < 2e-16 ***
## VPD_F -3.238e-01 7.285e-03 -44.455 < 2e-16 ***
## TA_F_MDS 1.659e-01 4.972e-03 33.361 < 2e-16 ***
## SW_IN_F 1.680e-02 9.998e-04 16.808 < 2e-16 ***
## P_F -1.290e-02 2.495e-03 -5.172 2.35e-07 ***
## WS_F -1.900e-01 1.355e-02 -14.020 < 2e-16 ***
## CO2_F_MDS -3.826e-03 8.153e-04 -4.693 2.71e-06 ***
## PA_F 5.735e-02 1.827e-02 3.139 0.00170 **
## siteidCH-Lae 8.313e-01 1.889e-01 4.400 1.09e-05 ***
## siteidFI-Hyy -5.900e-02 2.974e-01 -0.198 0.84274
## siteidFR-Pue -1.834e+00 2.782e-01 -6.590 4.52e-11 ***
## TIMESTAMP -6.450e-06 1.047e-05 -0.616 0.53783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.905 on 16860 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.6579, Adjusted R-squared: 0.6577
## F-statistic: 2494 on 13 and 16860 DF, p-value: < 2.2e-16
summary(second.multi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
## SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F, data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7477 -1.2228 -0.1153 1.1085 12.3241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5891656 0.4463032 -3.561 0.000371 ***
## PPFD_IN 0.0067222 0.0005163 13.021 < 2e-16 ***
## LW_IN_F 0.0152432 0.0006854 22.240 < 2e-16 ***
## VPD_F -0.3885851 0.0077764 -49.970 < 2e-16 ***
## TA_F_MDS 0.1035322 0.0049296 21.002 < 2e-16 ***
## SW_IN_F 0.0127810 0.0010750 11.889 < 2e-16 ***
## P_F -0.0233915 0.0026829 -8.719 < 2e-16 ***
## WS_F -0.1374643 0.0138279 -9.941 < 2e-16 ***
## CO2_F_MDS -0.0064341 0.0007675 -8.383 < 2e-16 ***
## PA_F 0.0077160 0.0032004 2.411 0.015922 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.064 on 16864 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.5982, Adjusted R-squared: 0.598
## F-statistic: 2790 on 9 and 16864 DF, p-value: < 2.2e-16
# Plot the model
plot(best.multi.lm)
It seems that the step-forward algorithm has been successfully implemented. However, it is important to choose the predictors very carefully because the final model could vary. The target, of course, must be removed before we can use the algorithm because if we use the target as a predictor, RSQ will always be 1. Additionally, we must be aware of linear combinations. We must always know what each variable stands for. The linear combination could disrupt our results. After that, we must decide whether the variable is a meaningful predictor. For example, a time factor could be very important. Maybe the experiment setting was outdoors and the climate factors were important. It could also be possible that our experimental setting was indoors and there were always labor conditions. That means, that the date and time are irrelevant. It is not always clear what a “good choice” is.
For task 1, we could have written a very similar function as for task 2. But we wrote a function that is a little bit more complicated than necessary. We did this to demonstrate how the algorithm works. The algorithm calculates a linear regression model for each predictor. With the table, we are able to see that the model with the highest RSQ is not necessarily the model with the lowest AIC.
We also wrote a function for task two. It is a step-forward algorithm. First, the function calculates a bivariate linear regression model and chooses the one with the highest RSQ. After that, a multivariate regression model is calculated, and again, the one with the highest RSQ is chosen. For each round, we compare the AIC. If the AIC is higher than the model before, we stop the calculation, and we have probably found our best-fit model. But here we have the same problem as described above. Our calculation depends on our input. Therefore, we need to consider which variables we include in the calculation.
For demonstration, we made some function calls. We can easily see that the results are different. That means we must (1) choose wisely (maybe with an educational guess), (2) try different calls to be able to estimate how big the difference is, and (3) document how and why we have decided.
Inspired by: https://www.statology.org and the book: “Statistik” form L. Fahrmeir
Residuals vs. fitted plot
We can see the right line follows almost the dashed line. It is near zero and approximately a horizontal line. That means the linearity is not violated and the assumption that we use a linear regression model was ok.
Q-Q-Plot
The Q-Q-Plot shows us that the residuals are almost following a normal distribution. The error is probably mainly statistical and not systematic. The plot also shows that the choice of a linear model was the right one because it follows almost the line with a 45° angle (the values would then have a perfect normal distribution).
Scale-Location Plot
The red line should be roughly horizontal across the plot. If so, then the assumption of homoscedasticity is probably right. The scattering should be random, and there should be no pattern at all. Both are approximately true. The spread of the residuals is roughly equal at all fitted values.
Leverage-Plot
There are no values behind the cook’s distance. That means that there are no significant outliers that distort our analysis.
The plots are approximately the same as for the bivariate model and need no more discussion. Only the scale-location plot needs a few words: The red line is not horizontal. That means there could be a problem with homoscedasticity. But it is not so dramatic and does not need an intervention. But we should keep this in mind.
We can see that RSQ increases quickly for the first few steps. But then it slows down, and suddenly there is no further improvement, and we can stop our model. For the AIC, we can see similar behavior. But AIC decreases instead of increasing. First, the decrease is fast and slows down. In the last step, the AIC increased a bit. That is when we stop our calculation.
The RSQ is based on the variance decomposition theorem, which states that the variance of the dependent variable can be written as the sum of a part of the variance explained by the regression model and the variance of the residuals (unexplained variance). By definition, it increases with additional variables, but the growth flattens out. Actually, we should discuss the adjusted RSQ instead of the RSQ. The RSQ is always a bit higher than the adjusted RSQ. But it does not matter which one we use in the algorithm. The algorithm stops if the AIC increases. If we choose a final model, then we have to use the adjusted RSQ. AIC is a kind of quality measure for the models. It is based on log-likelihood and a correction term that takes into account the number of variables. In the beginning, AIC becomes smaller. When the influence of the correction term becomes too large, the AIC does not get smaller anymore, and we have a basis for decision-making for the model choice.
It is also important to note that RSQ does not simply add up but must be recalculated constantly. This means that for a multivariate model, we cannot simply use the highest RR of the individual bivariate models or the variables with the highest correlation coefficients. This is visible in the figures “visualization of the development of RR and AIC (bivariate). That is why we have to use the AIC for a good model choice, and that is also why we recalculate the RR in our step-forward algorithm in each round.